Diacritics restoration based on word n-grams for Slovak texts

نویسندگان

چکیده

Abstract Despite the modern boom in technology, we are still faced with fact that people write texts without diacritics. There two main reasons for this. The first, historical reason stems from past when use of diacritics was troublesome and would text them. second one is speed - typing usually faster. Text easy to understand people, but some types documents, missing can cause a problem. This also an issue computers process such text. In this paper, propose algorithm based on word n-grams (a contiguous sequence n words) restore written Slovak language. We compare evaluate our results other algorithms developed

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Diacritics Restoration in Romanian Texts

There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depending on the functional role of the diacritical characters. For Romanian, automatic restoration o...

متن کامل

Diacritics restoration for Arabic dialect texts

Vocalization, diactritization or diacritics restoration is one of the major challenges in Arabic natural language processing. Algiers dialect is also concerned by this issue. In this paper, we present an automatic diacritization system for standard and dialect Arabic texts based on statistical approach. The idea is to use available tools in statistical machine translation to build such a system...

متن کامل

Beyond Word N-Grams

We describe, analyze, and experimentally evaluate a new probabilistic model for wordsequence prediction in natural languages, based on prediction suffi~v trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These ...

متن کامل

Automatic Diacritics Restoration for Hungarian

In this paper, we describe a method based on statistical machine translation (SMT) that is able to restore accents in Hungarian texts with high accuracy. Due to the agglutination in Hungarian, there are always plenty of word forms unknown to a system trained on a fixed vocabulary. In order to be able to handle such words, we integrated a morphological analyzer into the system that can suggest a...

متن کامل

Comparing Word Relatedness Measures Based on Google $n$-grams

Estimating word relatedness is essential in natural language processing (NLP), and in many other related areas. Corpus-based word relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based measures in the literature that can not be compared to each other as they use a different corpus. The purpose of this paper is to show how to evaluate different corpu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Open Computer Science

سال: 2021

ISSN: ['2299-1093']

DOI: https://doi.org/10.1515/comp-2020-0143